Rethinking the Whodunnit Approach to Assessing the Quality of Health Care Research -- A Call to Focus on the Evidence in Evidence-Based Practice

ipated that those working in the “trenches” of managed care pharmacy would appreciate an objective evaluation of the evidence about prescription drug cost-sharing. What we did not anticipate was a query we received from a consultant about our real reasons for writing the editorial. He asked about our business interest in the topic and why we had not disclosed the financial relationships that must have prompted our work. It was apparently inconceivable that 2 journal editors would closely examine the research evidence underlying a health care management tool for no reason other than an interest in determining the quality of the evidence. Apparently equally unthinkable was the fact that we had no conflicts of interest to disclose. We wonder if the cynical perspective evident in the question posed to us is unavoidable. We work in an industry in which most research is funded by parties with a commercial interest in the study outcomes. It is easy to conclude that a relationship between economic interest and research findings is inevitable. Adriane Fugh-Berman, author of numerous books and articles about complementary medicine, tells a similar tale: “When a university or medical society invites me to speak, I am often asked which company sponsors my talks. Not whether a company sponsors me, but which company.” 2

W hen we wrote our March, 2008 editorial in JMCP regarding the lack of evidence of value in valuebased insurance design (VBID), 1 we correctly anticipated that those working in the "trenches" of managed care pharmacy would appreciate an objective evaluation of the evidence about prescription drug cost-sharing. What we did not anticipate was a query we received from a consultant about our real reasons for writing the editorial. He asked about our business interest in the topic and why we had not disclosed the financial relationships that must have prompted our work. It was apparently inconceivable that 2 journal editors would closely examine the research evidence underlying a health care management tool for no reason other than an interest in determining the quality of the evidence. Apparently equally unthinkable was the fact that we had no conflicts of interest to disclose.
We wonder if the cynical perspective evident in the question posed to us is unavoidable. We work in an industry in which most research is funded by parties with a commercial interest in the study outcomes. It is easy to conclude that a relationship between economic interest and research findings is inevitable. Adriane Fugh-Berman, author of numerous books and articles about complementary medicine, tells a similar tale: "When a university or medical society invites me to speak, I am often asked which company sponsors my talks. Not whether a company sponsors me, but which company." 2 Why So Cynical? Widely Publicized "Scientists Behaving Badly" 3 Speculation about the relationship between business interest and the conduct of research is understandable given the media attention paid to financially-motivated unethical behavior -real or alleged -in the medical research enterprise in the past few years. Richard Smith, who edited the British Medical Journal from 1979 to 2004, described medical journals as "an extension of the marketing arm of pharmaceutical companies in publishing trials that [favor] their products." 4 In April, 2008, reports of excessive pharmaceutical manufacturer influence over publications in several major research journals prompted the editor of the Journal of the American Medical Association (JAMA) to describe the "manipulation" of the research literature as "disgusting." [5][6][7] Eyebrows were raised when Jeffrey Lisse, the first author of the Assessment of Differences between Vioxx and Naproxen to Ascertain Gastrointestinal Tolerability and Effectiveness (ADVANTAGE) trial, (Annals of Internal Medicine, 2003), told a New York Times reporter in 2005 that he had not actually had any role in the trial but had merely lent his name to a manuscript written by the manufacturer. 8 In response, an editorial published in Annals in 2005 declared that the frank admission of ghostwriting had "sent more than a few shivers up the spines of the editors." 8 Sismondo described such practices as "ghost management" (i.e., marketing masquerading as science) of research articles in the medical literature. 9 However, the raised eyebrows were later singed when new allegations emerged from the rofecoxib product liability litigation. Paid consultants for the plaintiffs contended that the manufacturer of rofecoxib not only ghost-managed the research and the ensuing publication but did so in a business practice known as "seeding," in which a drug manufacturer funds a clinical trial not only for research purposes but also to "jumpstart" sales by raising awareness of a drug among likely prescribers. 10 The manufacturer has denied the allegation. 11 Clashes between journal editors and authors over relationships between research findings and commercial interests are not uncommon. In one particularly high-profile example in 2005, a conflict developed between the editors of the New England Journal of Medicine (NEJM) and the authors of the Vioxx (rofecoxib) Gastrointestinal Outcomes Research (VIGOR) trial report (2000) 12 over what the editors described as "inaccuracies in data" about cardiovascular risk. 13 The debate had been prompted by the court-ordered discovery of the manufacturer's internal documents, which suggested that 3 additional myocardial infarctions in the rofecoxib treatment group were known to the manufacturer more than 4 months prior to publication but not reflected in the report's data analysis. 13 The VIGOR trial authors not employed by the manufacturer countered that because the infarctions had occurred after an a priori data collection and unblinding cutoff date, it would have been inappropriate to include them in the analysis. 14 A subsequent (2008) analysis of the same manufacturer's internal documents indicated that elevated mortality risk associated with rofecoxib treatment (pooled hazard ratio [HR] of 2.99; 95% confidence interval [CI] = 1.55-5.77) was known in 2001 but was not reported to the U.S. Food and Drug Administration (FDA) in timely fashion or clearly disclosed in study reports published in 2004 and 2005. 6 The manufacturer has denounced the allegation that it failed to disclose known risks as "false and misleading." 7 The journal peer review process, widely considered to be a hallmark of scientific integrity and quality, has also recently found itself the unwitting victim of bad behavior. In January, 2008, testimony before the U.S. Senate Finance Committee suggested that a prominent diabetes researcher, who had been reviewing a manuscript for NEJM, allegedly violated peer review confidentiality rules by leaking the manuscript to a pharmaceutical manufacturer that was likely to be negatively affected by its publication. 15 A less-publicized but fascinating incident occurred at the Journal of General Internal Medicine in 2004 when editors sent a manuscript to peer review. Coincidentally, one of the assigned peer reviewers had been asked in the previous year to lend her name to it even though she had not done any of the work; the actual author was a medical education company employee. 2 When the reviewer realized that the medical education company had successfully recruited another named author for the ghostwritten work, she reported the deception to the journal's editors, who rejected the paper, implemented new authorship disclosure policies, and publicly expressed concerns about the publication of literature that "injects bias and untruth into the scientific dialogue in order to enhance corporate profits. 16 Ghost Writing, Truth-Burying, and Other Bad Behavior: The Trees, Not the Forest Although allegations of ethical lapses -such as inaccuracies in authorship lists, deliberate failure to disclose drug safety issues, or information leaks -are important and make for salacious reading, sole focus of our attention on these issues will draw our attention away from a broader and more important question: In assessing the credibility of research information, are decision-makers limited to the simplistic foregone conclusion that financial interest will dictate research outcomes irrespective of actual evidence? Are decision-makers' evaluations of what they can believe limited to a superficial assessment of whom they can believe? Or can they turn to more objective standards in assessing the quality of research information?
To address this broader question, we begin by examining current evidence of a relationship between financial interest and research outcomes that has caused justifiable concern among many observers. We explain why the approach taken in many proposed solutions to the problem is likely to be ineffective. We propose as an alternative a more direct examination of quality of evidence, using available standards for conducting and reporting research. After reviewing key research standards and discussing evidence of widespread nonadherence to these basic standards, we propose a way forward.

Financial Interest and Research Results: Empirical Evidence
The view that financial interest ispo facto equates to results is supported by evidence closely linking study sponsorship to reported research outcomes. In a meta-analysis of 324 RCTs of cardiovascular devices and drugs published between January, 2000 and July, 2005 in 3 prominent journals (JAMA, Lancet, and NEJM), Ridker and Torres identified a strong relationship between fund-ing and positive outcomes for newer treatments. Of 205 drug trials, the proportions finding clinical superiority for the newer treatment were 39.5% for those with not-for-profit funding, 54.4% for those that were jointly funded, and 65.5% for those funded by for-profit companies (P = 0.002). 17 In an analysis of 504 studies of inhaled corticosteroids, Nieto et al. found a strong inverse relationship between pharmaceutical manufacturer funding and the likelihood of identifying adverse effects for a study drug; 34.5% of 275 studies funded by pharmaceutical manufacturers versus 65.1% of 229 studies with other funding reported that the study drug of interest had significantly higher rates of adverse effects than its comparators. 18 Survey data provide direct evidence that scientists perceive, and sometimes respond to, pressure from their funding sources. In a 2002 survey of 3,427 scientists who had received financial support from the National Institutes of Health, 15.5% admitted to "changing the design, methodology, or results of a study in response to pressure from a funding source" in the previous 3 years. Although only 0.3% admitted to "falsifying or 'cooking' research data," 13.5% said that they had used "inadequate or inappropriate research designs," and 10.8% acknowledged "withholding details of methodology or results in papers or proposals." 3 The problem was, according to the scientists surveyed, not limited to their own work; 12.5% of the survey respondents admitted to "overlooking others' use of flawed data or questionable interpretation of data." 3

AHRQ
Compilation and synthesis of 121 sources (scales, checklists or guidance documents) that "can be used in the production of systematic evidence reviews and technology assessments." Systems were characterized with respect to "important domains and elements" that safeguard against bias or are "long-accepted" research practices.

ISPOR Medication Compliance and Persistence Special Interest Group
Checklist of items to be "included, or at least considered," in retrospective database analyses of medication compliance or persistence. Includes definitions of commonly used terms. Refers to the ISPOR Retrospective Database Checklist for detailed standards to be used in assessing work derived from administrative claims databases.

ISPOR working group
Checklist to assist decision makers in assessing quality of analyses of administrative databases; covers numerous areas including relevance, reliability and validity, methodological considerations unique to claims databases, statistical analysis and interpretation.

EQUATOR network
Consistent with CONSORT; explicitly addresses issues common in nonrandomized studies that examine the effects of interventions.

EQUATOR network
Standards for the reporting of randomized controlled trials including study rationale, design, measurement, analysis, results, limitations, and discussion; includes citations to 204 references that can be accessed for additional information and examples. Main focus is validity (internal and external).

ISPOR working group
Standards for conducting and reporting analyses of economic end points incorporated into clinical trials.

STATISTICAL ANALYSES AND MODELS Chicago Guides 37,38
Chicago University Press Comprehensive textbook standards for data analysis and reporting. Guidelines for the conduct and reporting of cost-effectiveness analyses.

ISPOR Task Force
Comprehensive, detailed quality assessment document for decision analytic modeling. Emphasizes transparency, sensitivity analyses, empirically-based assumptions, and incorporation of new evidence as it becomes available.

Quality of Health Economic Studies (QHES) 41
Panel of 8 health economics experts; validated in a survey of 30 clinicians and 30 health economists Quality scoring system for cost-minimization, cost-effectiveness and cost-utility analyses, based on 19 previous guidelines and checklists and validated against expert assessment of quality. Primarily focused on technical details specific to cost-effectiveness (e.g., time horizon, perspective, economic model structure).

OTHER RESEARCH TYPES Patient Reported Outcomes (PRO) Harmonization Group report 42
Representatives from ISOQOL, ISPOR, PhRMA-HOC, and ERIQA Evolving standards for instruments and reporting of patient-reported data.
Quality of Reporting of Metaanalyses (QUORUM) 43 EQUATOR network Similar to CONSORT in concept, but addresses issues specific to meta-analyses of RCTs. crucial to evaluation of the quality of research evidence. 19 For our part, the JMCP disclosure forms for authors have been revised many times in response to instances of apparent ghost writing or ghost management. These JMCP disclosure forms require the corresponding author to certify the percentage contribution of every person that contributed in any way to the manuscript, including study concept and design, data collection, data interpretation, writing and revising the manuscript. 20 In 2001, the editors of JAMA supplemented the journal's enhanced disclosure policies with an additional response to the problems of "data misrepresentation … and selective reporting." 21 Submissions of manuscripts reporting industry-sponsored studies must be accompanied by certification that the data analysis was conducted by "an academic statistician who is not employed by the sponsor and who is at an academic center, such as a medical school, or is an employee of a government research institute." This approach was intended to provide "an additional layer of oversight for the integrity of the data analysis and reporting" and to rely on academic centers' usual "mechanism for investigation" of possible improprieties. JAMA's editors expressed considerable confidence in their policy change: "If all journals would have similar policies," they opined, "the likelihood of manipulation of data, inappropriate data analysis and selective reporting of results could be substantially decreased." 21

Replacing the "Whodunnit" Approach: A Sharper Focus on Quality of Evidence
These proposed solutions focus primarily on "whodunnit" questions -who paid for the study, who conducted it, who interpreted the data, who wrote the report, and who enforced ethical standards. Although full disclosure of financial interests and authorship arrangements is clearly important, addressing the health care research literature's current problems through the "whodunnit" approach alone is unlikely to be effective.
The requirement that analyses be conducted by an academician instead of a for-profit company employee misses 3 key points: first, that high-quality work can be, and often is, produced by for-profit companies; second, that academicians, like all of us, are fallible human beings who can make errors or be influenced by bias; and third, that academicians are potentially just as vulnerable as other scientists to funding source pressures. As Ridker and Torres point out in their meta-analytic assessment of research with for-profit and nonprofit sponsorship, design problems in trials conducted in academic settings have been observed in the literature, 17 such as in a 2006 investigation of high-dose statin therapy that was led by a medical school academician and published in a prominent journal despite having no control group. 17,22 Additionally, Ridker and Torres observe that even academicians may "nevertheless be under pressure to present results in the best possible light, in particular emphasizing favorable subgroup analyses when an overall neutral effect was observed for the primary end point." 17 To their point, Mello et al. (2005) surveyed 107 medical school research administrators who had authority over contracts with industry research sponsors for the conduct of clinical trials. Respondents were presented with a list of potential contract provisions and asked which would be acceptable or unacceptable to them. Allowing sponsors to "alter the study design after the agreement is executed" was viewed as acceptable by 62% of respondents and unacceptable by 27%. Allowing sponsors the right to insert their own statistical analyses in manuscripts was seen as acceptable by 24% of respondents; 47% said that this provision would be unacceptable and 29% were not sure. A remarkably high 50% of the respondents said that they would permit the sponsor to "write up the results for publication and the investigators may review the manuscript and suggest revisions," although 40% viewed such a provision as unacceptable. 23 Thus, an information consumer whose only approach for assessing quality of evidence is identifying who is "good" (e.g., who has pure motives, good training, etc. ) is hopelessly relegated to making judgments with inadequate information, thereby inevitably rejecting some high-quality work and accepting some poor-quality work. We embrace a different and more effective approach. Instead of looking at the motives or quality of peoplean exercise that is fraught with inherent uncertainty -we propose a more intense focus on the quality of work output using currently available standards for the conduct and reporting of research. Information consumers who rely on basic research quality standards will be in a strong position to distinguish poor-quality from high-quality work, no matter who performed or paid for it.

Research Standards: Are They Usable?
The report of a roundtable discussion of the use of real-world data in managed care decision-making, published in the April, 2008 issue of JMCP, observed that standards for the conduct and reporting of research are seldom used, perhaps because they are either not clearly available, not clearly understood, or both. 24 Confusion over the availability and utility of research standards is understandable, since a "Google" search on the term "research standards" yields dozens of entries ranging from vague admonitions to conduct scientifically valid and ethical work to detailed "how-to" publications.
Our analysis of the available tools to assess evidence quality examined documents that specifically address a question that is critically important to decision-makers: how to distinguish credible from less credible health care research results. Of those documents (Table 1), 25-44 many address technical issues relevant to only a single type of work (e.g., decision analytic modeling, studies of diagnostic test sensitivity and specificity) or to specific problems (e.g., ghost authorship, sponsorship ethics) rather than principles broadly applicable to all types of health care research. The documents included in our analysis (shown in bold print in the table) met the following criteria: (a) Address the question of how to conduct and/or report research.
Research reporting is especially critical because publications are typically a decision-maker's only potentially valid and reliable source of information about the conclusions reached by study authors. No information consumer should rely solely on popular press accounts. (b) Provide specific guidance rather than vague admonitions. For example, the advice that study groups should be appropriate is accurate but not especially helpful to a decision-maker unfamiliar with research methods. In contrast, the guidance that study groups should be comparable with respect to demographic and clinical characteristics, including age, gender, payer type, industry type, benefit design, and comorbidities, is sufficiently specific to be usable. (c) Address quality measures that are applicable to many types of health care research rather than being limited to specific technical issues. For example, the Quality of Health Economic Studies instrument, a quality-assessment checklist for costeffectiveness analyses, is not included in our detailed review because it addresses primarily technical issues (e.g., perspective of analysis, time horizon, discounting method) but not broader issues (e.g., how a literature review should be performed and reported or how to present implications of study findings for policy or future research.) 41 (d) Use is complementary with other guidelines (e.g., the AMCP dossier format for formulary submissions or the International Committee of Medical Journal Editors (ICMJE) standards for manuscript submissions). 27 For example, a managed care decision-maker assessing material for dossier inclusion or an author assessing studies for inclusion in a literature review may need to refer to detailed standards for quality of research evidence.
Although our review was not intended to be comprehensive of every research guideline available, it covers respected organizations and information sources frequently used by researchers in the managed health care field. These include: (a) University of Chicago Press reference guides to writing about numbers 37 and about multivariate analyses 38 (these 2 books provide virtually identical guidance); (b) International Society for Pharmacoeconomics and Outcomes Research (ISPOR) code of ethics and standards for conducting and reporting studies of medication compliance and persistence, cost-effectiveness analyses, decision analytic models, patient-reported outcomes, and retrospective analyses of administrative claims; 29-31,36,40,42 (c)

Key Principles in Conducting and Reporting Research
Transparency • Study selection criteria are specified in detail • Quantitative effect of each sampling criterion on sample size is disclosed explicitly; a sample derivation flow chart or table is preferred. • Details of intervention and comparison conditions are described explicitly • Methods are described in sufficient detail to enable replication • Specific contribution of each author and other contributors is disclosed. a Hierarchy of evidence quality • To maximize internal validity (accurate causal inference) design minimizes the effects of confounding • Randomized designs are ideal; quasi-experimental is acceptable • Results of studies with simple pre-post or cross-sectional designs are given lesser weight Quantitative exploration of possible bias • Report includes a demonstration (not just assertion) that study groups are comparable • Studies of claims databases address potential confounding factors specific to that data source • Possibility of bias is explored using sensitivity analyses on alternative methodological decisions or decision model inputs Basic logic • Study report reflects a basic logical and evidence-based connection, or "plausible mechanism" between input and outcome • Interpretation does not confuse association with causation (e.g., white hair does not increase the risk of mortality) b

Explicit description of study inclusion and exclusion criteria
Study selection criteria are explicit (e.g., list used for selection, specific ages and diagnoses, location, type of clinic, etc.); description should be specific enough to enable reader to determine applicability to his/her population 26,30,31,33,35,[37][38][39]

DISCUSSION AND CONCLUSIONS Association is distinguished from causation
The process that produces an association between variables is explored (e.g., a "plausible mechanism," "causal pathway," or "biologic credibility.") Report recognizes that findings of association(s) between variables could be spurious 31  Similarly, to enable the reader to assess how sampling procedures could have affected study validity, most guidelines emphasize that the quantitative effect of each sampling criterion on sample size should be disclosed explicitly. 30,31,[33][34][35]37,38 CONSORT and STROBE guidelines recommend a sample selection flowchart, 33-35 a JMCP requirement (e.g., Figure 1 in Stockl et al.) 45 Some but not all guidelines recommend comparisons (e.g., on demographic and clinical characteristics) of cases included in the sample with either cases excluded from the sample or benchmark data (e.g., data for a sample of patients with diabetes could be compared with similar nationwide data from the American Diabetes Association). 26,31,33,34,[37][38][39] Guidelines for studies of patient-reported outcomes include reporting of the participation rate, such as a survey response rate. 42 Guideline documents also recommend complete disclosure of the details of interventions and outcomes, including time horizon, measurement method, and codes used (e.g., International Classification of Diseases, Ninth Revision, Clinical Modification [ICD-9-CM], Current Procedural Terminology, and Health Care Common Procedural Coding System codes). Although the specific content of recommendations varies by type of study, the principle of complete disclosure is remarkably consistent. For example, in RCTs of drug treatments, standards call for disclosure of dose, administration schedule, procedures used to blind participants Table 2 shows key principles reflected in the guideline documents. Table 3 contains a detailed quality checklist summarizing the guidance provided by each information source, sorted approximately in the order of sections in a typical research paper. Key principles include:

Key Principle: Transparency
All sources emphasize transparency in reporting every phase of a research project. First, study selection criteria should be specified in sufficient detail that readers can clearly determine the applicability of the study sample to their populations. For example, the results of a benefit design intervention study conducted in a hospitality industry sample consisting primarily of young adults paid at minimum wage probably do not inform the expected results of the same intervention implemented in an information technology company consisting of highly paid middle-aged white collar executives and computer programmers. Similarly, results obtained in primary care clinics serving low-income patients are unlikely to predict outcomes in a clinic that accepts only private insurance. For this reason, guidelines that discuss sampling procedures state that inclusion and exclusion criteria -such as age range, diagnoses, period of time for qualification into study, location, setting, or any factor that potentially could influence outcomes -should be specified explicitly. 26,30,31,33,35,[37][38][39] Rethinking the "Whodunnit" Approach to Assessing the Quality of Health Care Research -A Call to Focus on the Evidence in Evidence-Based Practice cal characteristics and other potentially confounding factors is recognized in guidelines as important. 26,31,[33][34][37][38] Potential confounders vary by study topic, but would include, for example: (a) baseline compliance in a study of whether a disease management program affects medication persistency; (b) comparability of drug formularies in a study of whether particular benefit design features affect patient outcomes; or (c) baseline drug expenditure in a study of the effects of an intervention on payer or out-ofpocket cost. In the AHRQ's assessment of systems to rate the strength of scientific evidence, all 12 instruments that were used to grade quality of observational research included comparability of subjects as a standard. 25 Thus, guidelines specify the need for a descriptive table comparing study groups on relevant factors. 26,31,[33][34][37][38] For the author to merely state, rather than show, that study groups are comparable is recognized as inadequate. Additionally, guidelines for analyses of claims databases identify possible confounding factors that are specific to that data source. 30,31 First and most important is that eligibility for health insurance benefits must be addressed, either by limiting analyses to members continuously eligible for insurance benefits during the study period or by using outcomes measures that adjust for eligibility (e.g., cost per month of enrollment). Second is an assessment of whether features specific to particular health plans could affect data quality or capture. For example, in a study of cost and utilization of mental health services, the researchers must address whether a "carveout" arrangement with an external vendor for the provision of some or all mental health services could have affected the capture of claims in the dataset. When dealing with samples in which under-reporting of data is common, such as patients with mental illness or enrollees who are aged 65 or older and therefore Medicare-eligible, researchers should describe explicitly how potential concerns about data capture and completeness have been addressed.
Although all guidelines recognize the importance of validity, the authors of the CONSORT guidelines are particularly dismissive of the idea that departures from internal validity (i.e., accuracy in making causal inferences) are acceptable if they enhance external validity (i.e., the degree to which the data represent "realworld" conditions): "Internal validity is a prerequisite for external validity: the results of a flawed trial are invalid and the question of its external validity becomes irrelevant." 35

Key Principle: Quantitative Exploration of Possible Bias
Closely related to the understanding of hierarchy of evidence is the admonition in guideline documents that researchers should explore in quantitative fashion the possibility of alternative explanations for study findings. 26,31,[33][34][35]37,38 For example, in an observational study that has found higher medication persistence rates for Drug X than for Drug Y, the researcher should explore the possibility that the patients treated with Drug X were more compliant at baseline (e.g., by examining history of compliance with other therapeutic classes in the pre-treatment year). In a and investigators to placebo and active treatment, and qualifications of personnel administering the intervention. 26,35 In drug benefit studies, guidelines call for disclosure of benefit design details, such as formularies, step therapy and prior authorization requirements, carveouts, mail order incentives, or any other arrangements that could potentially affect study findings. 31 TREND guidelines for intervention evaluation studies employing nonrandomized designs describe as "critical" the reporting of "sufficient detail so that a reader has an understanding of the content and delivery of both the experimental intervention and the services in the comparison condition." 34 Guideline sources, particularly those that address calculation methodologies in detail, are consistent with the ICMJE standard that statistical methods should be described "with enough detail to enable a knowledgeable reader with access to the original data to verify the reported results." 27 Both the USPHS recommendations for cost-effectiveness analyses and the ISPOR guidelines for decision analytic models include specification of all input values, data sources, and calculation methods. 39,40 CONSORT authors effectively summed up the guidance on transparency: "Readers should not have to speculate; the methods used should be transparent, so that readers can readily differentiate trials with unbiased results from those with questionable results. Sound science encompasses adequate reporting, and the conduct of ethical trials rests on the footing of sound science." 35

Key Principle: Hierarchy of Evidence Quality
Guideline documents that address research design uniformly recognize that a key element in evidence quality is, as an AHRQ working group described it, "the extent to which all aspects of a study's design and conduct can be shown to protect against systematic bias, nonsystematic bias, and inferential error." 25 Key to this assessment is the minimization of confounding, defined as the degree to which subjects in intervention and control (or comparison) groups differ on a factor, other than the intervention, that might affect the study outcome(s). Randomization of study groups is widely recognized as the "gold standard" method to ensure the absence of confounding. 35 Quasi-experimental designs, such as those that compare pre-intervention versus postintervention outcomes across 2 groups (i.e., a group that receives the intervention vs. a similar group that does not) are recognized as a reasonable alternative to randomization. Cross-sectional designs, which compare outcomes for nonrandomized groups during a single time period, and designs that compare pre-intervention versus post-intervention outcomes without a comparison group are recognized as more prone to producing invalid results because they are subject to confounding effects (e.g., unmeasured group differences or changes that would have occurred over time even without the intervention). This recognition is clear both in guideline documents and in methods textbooks. 26,31,35,[37][38][39]46 For nonrandomized designs, a demonstration that study groups are comparable with respect to demographic and clini-Presentation of an evidence-based logical pathway between input and outcome also helps the reader feel confident in another aspect of study design that is widely recognized as importantthe use of a priori rather than post hoc methodologies. 29,31,[33][34][35][36]42 For example, if the research literature uniformly defines Disease X as a 3-month history of symptoms, a researcher who, without explanation, selects a study population based on a 6-month symptom history opens the analysis to criticism. Without a clear evidence-based explanation for the sample selection criterion, it appears to readers that methods were hand-picked to produce a desired answer rather than specified in advance.

Key Principle: Meaning and Limitations of Statistical Methodologies
Guideline documents, especially those that specifically address the topic of statistical methods, emphasize in several ways the dangers of over-interpreting or misinterpreting statistical analytic results. A first key example, cited by every guideline source, is the need to distinguish between statistical significance and clinical or practical significance. 26,30,31,33,35,[37][38][39]42 In analyses with very large sample sizes, which are common in retrospective analyses of claims databases, virtually any result will be statistically significant; however, practical significance must be assessed carefully using absolute, rather than relative, measures of effects on outcomes. For example, in a study comparing medication possession ratios for Drug X versus Drug Y with sample sizes of 30,000 in each group, very small differences of less than 0.5 percentage points, or about 1.5 days of pharmacotherapy per year, will be statistically significant but have no practical meaning whatsoever. Conversely, some studies may be underpowered by insufficient sample size; for example, a 1-year persistency rate of 67% for Drug X versus 50% for Drug Y, with sample sizes of 30 in each group, suggests a possibly clinically important difference that has not been tested adequately. Using a 2-tailed test and alpha (Type I error) of 0.05, the number necessary for 80% power to detect a difference of 67% versus 50% is 130 in each group. Thus, a finding of no significant difference between the groups is not interpretable.
Additional admonitions common to guidelines for statistical analyses address a common phenomenon in health care research, the use of multivariate analyses to adjust for baseline differences in studies with observational designs. Guidelines specify that: (a) descriptive analyses of study outcomes should always be presented, (b) multivariate analyses should be presented only if necessary, and (c) multivariate analyses should never be presented without accompanying descriptive statistics. 33,35,37,38 While recommendations for the content of descriptive analyses vary slightly by source, Chicago Guide recommendations include mean and standard deviation, or median and interquartile range for skewed data, minimum and maximum (range), and counts of subjects or patients included in each analysis. 37,38 Finally, guidelines emphasize that researchers should report study in which ICD-9-CM codes of 296.2 (major depression, single episode) and 296.3 (major depression, recurrent episode) were used to select patients with depression, sensitivity analyses should explore the use of other diagnostic codes, such as 311 (depressive disorder not elsewhere classified), because of the possibility that diagnoses of major depression were deliberately or inadvertently miscoded in the bill submitted by the provider for payment. Readers should be especially alert for, and suspicious of, research in which multiple definitions or approaches could reasonably have been used, but the study authors have chosen just 1 of these and performed no sensitivity analyses to assess the effects of their choices. The need for sensitivity analyses is featured prominently in guideline documents for decision analytic and cost-effectiveness modeling. 39,40 ISPOR's recommendations for best practice in modeling studies indicate that "model results should never be presented as point estimates or as unconditional claims of effectiveness or cost," but instead should be "represented as conditional upon the input data and assumptions, and they should include extensive sensitivity analysis to explore the effects of alternative data and assumptions on the results." 40 Key Principle: Basic Logic A common theme that emerges from review of the guideline documents is that of a logical and evidence-based connection or "plausible mechanism" between input and outcome; findings should have "biologic credibility." 31,33-38 Guidelines for decision analytic models emphasize the need for an evidence-based "event pathway," explicitly showing the reader the relationships among study variables to make the rationale for the calculations clear. [39][40] in writing about the ISPOR recommendations for modeling studies, note that a model's value "lies not only in the results it generates, but also in its ability to reveal the logical connection between inputs (i.e., data and assumptions) and outputs in the form of valued consequences and costs." Thus, readers should be able to understand the "logic behind [the model's] results … at an intuitive level." 40 Guidelines for observational research emphasize that a plausible event pathway is especially critical to avoid confusion between association and causation. In a marvelous tongue-incheek example of how not to interpret observational data, statistician Jane Miller points out that a naïve researcher, noting that the mortality rate among people with white hair is higher than the rate for those whose hair has remained its original color, might write something like this: "White hair increased the risk of dying by 400%." 38 Extending Miller's example to an error commonly committed in observational analyses in health care, one might see a naïve researcher report that, given the association between white hair and elevated risk of hospitalization and death, we could save millions of dollars and hundreds of thousands of lives each year if only everyone of a certain age would simply make a small investment in the cost of hair coloration products.
has become an impetus for the development of guideline documents. In a particularly troubling example, the TREND group was formed in 2003 after a group of researchers tried unsuccessfully to synthesize multiple studies of behavioral interventions in patients with human immunodeficiency virus. So many study reports had "failed to include critical information (e.g., intervention timing and dosage, effect size data)" that the researchers were unable to complete their work. 34 Flaws of this type may have clinical consequences. A discussion of the Grading of Recommendations Assessment, Development and Evaluation (GRADE) system for ranking types of evidence notes that the recommendation of hormone replacement therapy to reduce cardiovascular risk in post-menopausal women, which had been "dutifully applied" by primary care physicians for about a decade, was based on observational research that ultimately was shown to have produced erroneous findings. "Had a rigorous system of rating the quality of evidence been applied at the time," noted the GRADE observers in 2008, "it would have shown that because the data came from observational studies with inconsistent results, the evidence for a reduction in cardiovascular risk was of very low quality." 49 In a similar sequence of events described by Jefferson and Di Pietrantonj in 2007, studies conducted in the decades prior to 2006, employing mostly observational designs, documented a widely accepted association between use of influenza vaccines in persons aged 65 and older and reduced all-cause death rate. When Jefferson and Di Pietrantonj pointed out that vaccine use was associated with only the all-cause death rate, not the death rate from pneumonia or influenza, making the association biologically implausible and probably due to confounding, they became the target of "much support and some scorn … Eminent immunologists told readers that our interpretations were evidently false." 50 Yet, in 2007, a review of the evidence concluded that the observational study results were attributable to selection bias, and sided with Jefferson and Di Pietrantonj. 50 These examples are unfortunately not unique; a 1999 examination of the methodological quality of clinical practice guidelines found that only 13% of the guideline development documents had graded their recommendations according to strength of the supporting evidence. 51

Problems in Quality of Evidence: Preventable with Insistence on Better Reporting
Perhaps the best evidence that quality problems are preventable comes from research studies in which the authors directly reported data that either called the authors' conclusions into question or entirely refuted them, yet the data were not reflected in the interpretation appearing in the study report.
For example, in a comparison of entecavir with lamivudine in the treatment of hepatitis B, Chang et al. reported that the study's primary end point outcome, the rate of "histologic improvement," was higher for entecavir than for than lamivudine (72% vs. 62%; degree of precision around statistical estimates, such as 95% confidence intervals in analyses of randomized trials or observational data, or "bootstrapping" estimates of uncertainty for decision analytic models. 26,30,33,[35][36][37][38][39][40] All sources also note the importance of recognizing the possibility of unmeasured confounding factors. 26,31,33,[36][37][38][39] Thus, reports of multivariate models should include estimates of model adequacy, such as goodness of fit measures or a measure of predictive accuracy (e.g., R square for linear regression). 31,38

Consistent But Not Consistently Used: Problems in the Published Literature
The remarkable consistency in the content of published guidelines is matched by an equally remarkable inconsistency in adherence to them. In 2002, a review of the quality of evidence in the medical literature described the problems of "poor methodology and reporting" as "widespread." 47 The review documented that the prevalence rates of serious methodological errors in published randomized trials in multiple medical disciplines were high. For example, 25% of 364 reports in surgery journals failed to specify eligibility criteria for study inclusion. The percentages of articles failing to report the intervention allocation method were 89% of 196 reports in rheumatoid arthritis journals, 48% of 206 reports in obstetrics/gynecology journals, and 44% of 80 reports in general medical journals. Of 196 reports in rheumatoid arthritis journals, 63% analyzed multiple observations using inappropriate statistical techniques, and of 50 reports in general medical journals, 58% used incorrect techniques in comparing subgroups. 47 Similar problems have been identified in the observational research literature. The STROBE group reported that in an analysis of observational studies of stroke, 17 of 49 reports did not specify eligibility criteria for study inclusion. 33 In a review of studies that used multivariate statistical methodologies, only 93 of 169 articles (55%) stated clearly how the variables were entered into the model. A survey of published epidemiologic studies found that "some information regarding" participation rates was provided in only 59% of case-control and cross-sectional studies, and only 32% of cohort studies. 33 A study of 132 articles published in cancer journals found that only about one-half provided information about length of follow-up. 33 A recent study of abstracts of 222 articles published in leading medical journals found that when relative risks were reported, absolute risks were reported in 62% of RCT abstracts but only 21% of cohort study abstracts. 33 In Pocock et al.'s (2004) study of epidemiological publications in prominent journals, most exposures (e.g., dietary habits, lifestyle factors, or genetic markers) were grouped into categories without any stated rationale for the groupings, and most articles failed to explain the variables used to adjust for confounding; the authors described as "serious" the "risk that some epidemiological publications reach misleading conclusions." 48 In some instances, the poor quality of the extant literature Rethinking the "Whodunnit" Approach to Assessing the Quality of Health Care Research -A Call to Focus on the Evidence in Evidence-Based Practice P = 0.009). 52 However, the study's primary data table showed that the percentages of patients with "no improvement" did not significantly differ by drug, and the counts of the "histologic improvement" and "no improvement" subgroups did not sum to the total number of study patients. A footnote to the data table explained the discrepancy; the denominators used in the calculations included patients who had no follow-up biopsy specimens, while the numerators excluded them. Excluding patients with missing follow-up specimens from both the denominator and the numerator, the percentages of cases with histologic improvement were 77% for entecavir and 72% for lamivudine, yielding a non-significant P value of 0.204 for the comparison (Fisher's exact test). Nonetheless, the study abstract reported superiority for entecavir. 53 Yank et al. reported a similar problem in their systematic review of meta-analyses of antihypertensive medications that had been published from 1983-2004. When they compared studies funded by a single drug manufacturer with those funded by other sources (i.e., multiple funding sources, non-profit, and undisclosed), a financial tie to a drug manufacturer was not significantly associated with favorable results (i.e., objective findings, odds ratio [OR] = 0.65, P = 0.25). However, manufacturer funding was significantly associated with favorable conclusions (i.e., interpretation, OR = 4.09, P = 0.016). Yank et al. opined that their results reveal "a failure of peer review. Both editors and peer reviewers must have read manuscript versions of those metaanalyses containing discordant results and conclusions, yet they did not prevent publication of biased conclusions." 54 Notably, as an example of the need for data analyses by academic statisticians, the JAMA editorial on pharmaceutical industry influence cited a secondary analysis of a 2005 trial of the use of rofecoxib in patients with mild cognitive impairment. 6,21,55 As the secondary analysis noted, the 2005 report did not reflect the higher all-cause mortality rate for rofecoxib compared with placebo, even though this difference had been disclosed in internal manufacturer documents about 4 years previously. 6 Yet, examination of the original study report suggests problems in reporting and calculation. The sample selection flow chart shows the numbers of patients in each group who experienced an "adverse event," but does not show how many of those events were fatalities. Additionally, the report's text discloses that in the "off-drug follow-up" period, for which data were available for "less than half of the patients," the death rates were 4.8% (17/356) for rofecoxib and 1.6% (5/307) for placebo; the statistical significance of this difference (Fisher's Exact Test P = 0.029) was not reported. 55

Attention to Quality: The Best Solution to Quality Problems
It is tempting to think that if research is simply turned over to smart people who have no overt commercial interest in its outcome, concern over the quality of research evidence will become a thing of the past. Such a solution requires little real effort on the part of information consumers; simply ensure that the right people are doing the work and settle back to enjoy the fruits of their labors. However, in an environment in which basic standards for quality of evidence are widely available and yet commonly ignored, this approach suggests abrogation of responsibility by information consumers, as well as peer reviewers and editors. We believe that the solution to the current problems with the quality of peer-reviewed health care literature does not lie solely, or even primarily, in a focus on "whodunnnit" but rather in renewed attention to basic standards for quality of evidence. These standards represent nothing more complicated or difficult to understand than lessons repeated in undergraduate-level methodology classes that have been taught for decades, coupled with a little common sense. Using these standards does not take a degree in research methodology or statistics; it takes only the commitment to adhere to them.
As the editors of Journal of General Internal Medicine sagely observed 3 years ago in discussing the effect of pharmaceutical manufacturer funding on the research literature, "perfection is an asymptote, a goal that can be approximated, but never reached." 16 Yet, to strive for the goal of disseminating only highquality evidence, it is necessary to assess distance from the goal. Comparisons of research standards with the characteristics of much of the published health care literature suggest that we have a long way to go.
A cynical perspective on health care research, although understandable, is neither inevitable nor wise. What is needed is critical analysis of the quality of the evidence, regardless of who did the research. It is more important to determine what evidence to trust than whom to trust. In managed care, the central focus should be on the customers -the patients, pharmacists, physicians and other care givers -who deserve accurate and transparent information about the expected clinical and economic effects of treatment options. All stakeholders, but especially patients, have a lot to lose when studies of treatment options are not validated against accepted standards for conducting and reporting research.